Dataset Pruning

a) data selection, coreset selection, dataset pruning: select a subset of training data

survey on coreset selection: https://arxiv.org/pdf/2505.17799

some papers of dataset pruning for generative model:

  * Li, Yize, et al. "Pruning then reweighting: Towards data-efficient training of diffusion models." ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025.

  * Moser, Brian B., Federico Raue, and Andreas Dengel. "A study in dataset pruning for image super-resolution." International Conference on Artificial Neural Networks. Cham: Springer Nature Switzerland, 2024.

dataset quantization: divide the training set into different bins and select representative samples in each bin. It could be used as a data selection strategy.

data attribution: survey on data attribution: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5451054. It could be used as a measurement for data selection. Some papers of data attribution for generative model:

  * Georgiev, Kristian, et al. "The journey, not the destination: How data guides diffusion models." arXiv preprint arXiv:2312.06205 (2023).
  * Zheng, Xiaosen, et al. "Intriguing properties of data attribution on diffusion models." arXiv preprint arXiv:2311.00500 (2023).
  * Lin, Jinxu, et al. "Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Models." arXiv preprint arXiv:2410.18639 (2024).

b) dataset distillation: optimize the training set, the optimized training images are not realistic images. There is no work using distilled images to train generative model.